INTERSPEECH.2023 - Speech Synthesis | Cool Papers

#1 Emotional Talking Head Generation based on Memory-Sharing and Attention-Augmented Networks [PDF³] [Copy] [Kimi¹¹]

Authors: Jianrong Wang ; Yaxin Zhao ; Li Liu ; Tianyi Xu ; Qi Li ; Sen Li

Given an audio clip and a reference face image, the goal of the talking head generation is to generate a high-fidelity talking head video. Although some audio-driven methods of generating talking head videos have made some achievements in the past, most of them only focused on lip and audio synchronization and lack the ability to reproduce the facial expressions of the target person. To this end, we propose a talking head generation model consisting of a Memory-Sharing Emotion Feature extractor (MSEF) and an Attention-Augmented Translator based on U-net (AATU). Firstly, MSEF can extract implicit emotional auxiliary features from audio to estimate more accurate emotional face landmarks. Secondly, AATU acts as a translator between the estimated landmarks and the photo-realistic video frames. Extensive qualitative and quantitative experiments have shown the superiority of the proposed method to the previous works. Codes will be made publicly available.

#2 Speech Synthesis with Self-Supervisedly Learnt Prosodic Representations [PDF²] [Copy] [Kimi⁶]

Authors: Zhao-Ci Liu ; Zhen-Hua Ling ; Ya-Jun Hu ; Jia Pan ; Jin-Wei Wang ; Yun-Di Wu

This paper presents S4LPR, a Speech Synthesis model conditioned on Self-Supervisedly Learnt Prosodic Representations. Instead of using raw acoustic features, such as F0 and energy, as intermediate prosodic variables, three self-supervised speech models are designed for comparison and are pre-trained on large-scale unlabeled data to extract frame-level prosodic representations. In addition to vanilla wav2vec 2.0, the other two pre-trained models learn representations from LPC residuals or adopt a multi-task learning strategy to focus on the prosodic information in speech. Based on FastSpeech2 and PnGBERT, our acoustic model is built with the learned prosodic representations as intermediate variables. Experimental results demonstrate that the naturalness of speech synthesized using S4LPR is significantly better than the FastSpeech2 baseline.

#3 EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis [PDF²] [Copy] [Kimi⁵]

Authors: Haobin Tang ; Xulong Zhang ; Jianzong Wang ; Ning Cheng ; Jing Xiao

There has been significant progress in emotional Text-To-Speech (TTS) synthesis technology in recent years. However, existing methods primarily focus on the synthesis of a limited number of emotion types and have achieved unsatisfactory performance in intensity control. To address these limitations, we propose EmoMix, which can generate emotional speech with specified intensity or a mixture of emotions. Specifically, EmoMix is a controllable emotional TTS model based on a diffusion probabilistic model and a pre-trained speech emotion recognition (SER) model used to extract emotion embedding. Mixed emotion synthesis is achieved by combining the noises predicted by diffusion model conditioned on different emotions during only one sampling process at the run-time. We further apply the Neutral and specific primary emotion mixed in varying degrees to control intensity. Experimental results validate the effectiveness of EmoMix for synthesizing mixed emotion and intensity control.

#4 Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus [PDF¹] [Copy] [Kimi²]

Authors: Detai Xin ; Shinnosuke Takamichi ; Ai Morimatsu ; Hiroshi Saruwatari

We present a large-scale in-the-wild Japanese laughter corpus and a laughter synthesis method. Previous work on laughter synthesis lacks not only data but also proper ways to represent laughter. To solve these problems, we first propose an in-the-wild corpus comprising 3.5 hours of laughter, which is to our best knowledge the largest laughter corpus designed for laughter synthesis. We then propose pseudo phonetic tokens (PPTs) to represent laughter by a sequence of discrete tokens, which are obtained by training a clustering model on features extracted from laughter by a pretrained self-supervised model. Laughter can then be synthesized by feeding PPTs into a text-to-speech system. We further show PPTs can be used to train a language model for unconditional laughter generation. Results of comprehensive subjective and objective evaluations demonstrate that the proposed method significantly outperforms a baseline method, and can generate natural laughter unconditionally.

#5 Explicit Intensity Control for Accented Text-to-speech [PDF] [Copy] [Kimi³]

Authors: Rui Liu ; Haolin Zuo ; De Hu ; Guanglai Gao ; Haizhou Li

Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of the standard version (L1). How to control the intensity of accent is a very interesting research direction. Recent works design a speaker-adversarial loss to disentangle the speaker and accent information, and then adjust the loss weight to control the accent intensity. However, there is no direct correlation between the disentanglement factor and natural accent intensity. To this end, this paper proposes a new intuitive and explicit accent intensity control scheme for accented TTS. Specifically, we first extract the posterior probability from the L1 speech recognition model to quantify the phoneme accent intensity for accented speech, then design a FastSpeech2 based TTS model, named Ai-TTS, to take the accent intensity expression into account during speech generation. Experiments show that our method outperforms the baseline model in terms of accent rendering and intensity control.

#6 Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech [PDF¹] [Copy] [Kimi²]

Authors: Guangyan Zhang ; Thomas Merritt ; Sam Ribeiro ; Biel Tura-Vecino ; Kayoko Yanagisawa ; Kamil Pokora ; Abdelhamid Ezzerg ; Sebastian Cygert ; Ammar Abbas ; Piotr Bilinski ; Roberto Barra-Chicote ; Daniel Korzekwa ; Jaime Lorenzo-Trueba

Neural text-to-speech systems are often optimized on L1/L2 losses, which make strong assumptions about the distributions of the target data space. Aiming to improve those assumptions, Normalizing Flows and Diffusion Probabilistic Models were recently proposed as alternatives. In this paper, we compare traditional L1/L2-based approaches to diffusion and flow-based approaches for the tasks of prosody and mel-spectrogram prediction for text-to-speech synthesis. We use a prosody model to generate log-f0 and duration features, which are used to condition an acoustic model that generates mel-spectrograms. Experimental results demonstrate that the flow-based model achieves the best performance for spectrogram prediction, improving over equivalent diffusion and L1 models. Meanwhile, both diffusion and flow-based prosody predictors result in significant improvements over a typical L2-trained prosody models.

#7 FACTSpeech: Speaking a Foreign Language Pronunciation Using Only Your Native Characters [PDF²] [Copy] [Kimi¹]

Authors: Hong-Sun Yang ; Ji-Hoon Kim ; Yoon-Cheol Ju ; Il-Hwan Kim ; Byeong-Yeol Kim ; Shuk-Jae Choi ; Hyung-Yong Kim

Recent text-to-speech models have been requested to synthesize natural speech from language-mixed sentences because they are commonly used in real-world applications. However, most models do not consider transliterated words as input. When generating speech from transliterated text, it is not always natural to pronounce transliterated words as they are written, such as in the case of song titles. To address this issue, we introduce FACTSpeech, a system that can synthesize natural speech from transliterated text while allowing users to control the pronunciation between native and literal languages. Specifically, we propose a new language shift embedding to control the pronunciation of input text between native or literal pronunciation. Moreover, we leverage conditional instance normalization to improve pronunciation while preserving the speaker identity. The experimental results show that FACTSpeech generates native speech even from the sentences of transliterated form.

#8 Cross-Lingual Transfer Learning for Phrase Break Prediction with Multilingual Language Model [PDF] [Copy] [Kimi²]

Authors: Hoyeon Lee ; Hyun-Wook Yoon ; Jong-Hwan Kim ; Jae-Min Kim

Phrase break prediction is a crucial task for improving the prosody naturalness of a text-to-speech (TTS) system. However, most proposed phrase break prediction models are monolingual, trained exclusively on a large amount of labeled data. In this paper, we address this issue for low-resource languages with limited labeled data using cross-lingual transfer. We investigate the effectiveness of zero-shot and few-shot cross-lingual transfer for phrase break prediction using a pre-trained multilingual language model. We use manually collected datasets in four Indo-European languages: one high-resource language and three with limited resources. Our findings demonstrate that cross-lingual transfer learning can be a particularly effective approach, especially in the few-shot setting, for improving performance in low-resource languages. This suggests that cross-lingual transfer can be inexpensive and effective for developing TTS front-end in resource-poor languages.

#9 DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech [PDF¹] [Copy] [Kimi²]

Authors: Sen Liu ; Yiwei Guo ; Chenpeng Du ; Xie Chen ; Kai Yu

Although high-fidelity speech can be obtained for intralingual speech synthesis, cross-lingual text-to-speech (CTTS) is still far from satisfactory as it is difficult to accurately retain the speaker timbres (i.e. speaker similarity) and eliminate the accents from their first language (i.e. nativeness). In this paper, we demonstrated that vector-quantized (VQ) acoustic feature contains less speaker information than mel-spectrogram. Based on this finding, we propose a novel dual speaker embedding TTS (DSE-TTS) framework for CTTS with authentic speaking style. Here, one embedding is fed to the acoustic model to learn the linguistic speaking style, while the other one is integrated into the vocoder to mimic the target speaker's timbre. Experiments show that by combining both embeddings, DSE-TTS significantly outperforms the state-of-the-art SANE-TTS in cross-lingual synthesis, especially in terms of nativeness.

#10 Generating Multilingual Gender-Ambiguous Text-to-Speech Voices [PDF] [Copy] [Kimi¹]

Authors: Konstantinos Markopoulos ; Georgia Maniati ; Georgios Vamvoukakis ; Nikolaos Ellinas ; Georgios Vardaxoglou ; Panos Kakoulidis ; Junkwang Oh ; Gunu Jho ; Inchul Hwang ; Aimilios Chalamandaris ; Pirros Tsiakoulis ; Spyros Raptis

The gender of any voice user interface is a key element of its perceived identity. Recently there has been increasing interest in interfaces where the gender is ambiguous rather than clearly identifying as female or male. This work addresses the task of generating novel gender-ambiguous TTS voices in a multi-speaker, multilingual setting. This is accomplished by efficiently sampling from a latent speaker embedding space using a proposed gender-aware method. Extensive objective and subjective evaluations clearly indicate that this method is able to efficiently generate a range of novel, diverse voices that are consistent and perceived as more gender-ambiguous than a baseline voice across all the languages examined. Interestingly, the gender perception is found to be robust across two demographic factors of the listeners: native language and gender. To our knowledge, this is the first systematic and validated approach that can reliably generate a variety of gender-ambiguous voices.

#11 RAD-MMM: Multilingual Multiaccented Multispeaker Text To Speech [PDF²] [Copy] [Kimi²]

Authors: Rohan Badlani ; Rafael Valle ; Kevin J. Shih ; João Felipe Santos ; Siddharth Gururani ; Bryan Catanzaro

We create a multilingual speech synthesis system that can generate speech with a native accent in any seen language while retaining the characteristics of an individual's voice. It is expensive to obtain bilingual training data for a speaker and the lack of such data results in strong correlations that entangle speaker, language, and accent, resulting in poor transfer capabilities. To overcome this, we present RADMMM, a speech synthesis model based on RADTTS with explicit control over accent, language, speaker, and fine-grained F0 and energy features. Our proposed model does not rely on bilingual training data. We demonstrate an ability to control synthesized accent for any speaker in an open-source dataset comprising of 7 languages, with one native speaker per language. Human subjective evaluation demonstrates that, when compared to controlled baselines, our model better retains a speaker's voice and target accent, while synthesizing fluent speech in all target languages and accents in our dataset.

#12 Multilingual context-based pronunciation learning for Text-to-Speech [PDF] [Copy] [Kimi¹]

Authors: Giulia Comini ; Sam Ribeiro ; Fan Yang ; Heereen Shim ; Jaime Lorenzo-Trueba

Phonetic information and linguistic knowledge are an essential component of a Text-to-speech (TTS) front-end. Given a language, a lexicon can be collected offline and Grapheme-to-Phoneme (G2P) relationships are usually modeled in order to predict the pronunciation for out-of-vocabulary (OOV) words. Additionally, post-lexical phonology, often defined in the form of rule-based systems, is used to correct pronunciation within or between words. In this work we showcase a multilingual unified front-end system that addresses any pronunciation related task, typically handled by separate modules. We evaluate the proposed model on G2P conversion and other language-specific challenges, such as homograph and polyphones disambiguation, post-lexical rules and implicit diacritization. We find that the multilingual model is competitive across languages and tasks, however, some trade-offs exists when compared to equivalent monolingual solutions.

#13 Listener sensitivity to deviating obstruents in WaveNet [PDF] [Copy] [Kimi]

Authors: Ayushi Pandey ; Jens Edlund ; Sébastien Le Maguer ; Naomi Harte

This paper investigates the perceptual significance of the deviation in obstruents previously observed in WaveNet vocoders. The study involved presenting stimuli of varying lengths to 128 participants, who were asked to identify whether each stimulus was produced by a human or a machine. The participants' responses were captured using a 2-alternative forced choice task. The study found that while the length of the stimuli did not reliably affect participants' accuracy in the task, the concentration of obstruents did have a significant effect. Participants were consistently more accurate in identifying WaveNet stimuli as machine when the phrases were obstruent-rich. These findings show that the deviation in obstruents reported in WaveNet voices is perceivable by human listeners. The test protocol may be of wider utility in TTS.

#14 How Generative Spoken Language Modeling Encodes Noisy Speech: Investigation from Phonetics to Syntactics [PDF] [Copy] [Kimi¹]

Authors: Joonyong Park ; Shinnosuke Takamichi ; Tomohiko Nakamura ; Kentaro Seki ; Detai Xin ; Hiroshi Saruwatari

We examine the speech modeling potential of generative spoken language modeling (GSLM), which involves using learned symbols derived from data rather than phonemes for speech analysis and synthesis. Since GSLM facilitates textless spoken language processing, exploring its effectiveness is critical for paving the way for novel paradigms in spoken-language processing. This paper presents the findings of GSLM's encoding and decoding effectiveness at the spoken-language and speech levels. Through speech resynthesis experiments, we revealed that resynthesis errors occur at the levels ranging from phonology to syntactics and GSLM frequently resynthesizes natural but content-altered speech.

#15 MOS vs. AB: Evaluating Text-to-Speech Systems Reliably Using Clustered Standard Errors [PDF] [Copy] [Kimi¹]

Authors: Joshua Camp ; Tom Kenter ; Lev Finkelstein ; Rob Clark

The quality of synthetic speech is typically evaluated using subjective listening tests. An underlying assumption is that these tests are reliable, i.e., running the test multiple times gives consistent results. A common approach to study reliability is a replication study. Existing studies focus primarily on Mean Opinion Score (MOS), and few consider the error bounds from the original test. In contrast, we present a replication study of both MOS and AB preference tests to answer two questions: (1) which of the two test types is more reliable for system comparison, and (2) for both test types, how reliable are the results with respect to their estimated standard error? We find that while AB tests are more reliable for system comparison, standard errors are underestimated for both test types. We show that these underestimates are partially due to broken independence assumptions, and suggest alternate methods of standard error estimation that account for dependencies among ratings.

#16 RAMP: Retrieval-Augmented MOS Prediction via Confidence-based Dynamic Weighting [PDF¹] [Copy] [Kimi²]

Authors: Hui Wang ; Shiwan Zhao ; Xiguang Zheng ; Yong Qin

Automatic Mean Opinion Score (MOS) prediction is crucial to evaluate the perceptual quality of the synthetic speech. While recent approaches using pre-trained self-supervised learning (SSL) models have shown promising results, they only partly address the data scarcity issue for the feature extractor. This leaves the data scarcity issue for the decoder unresolved and leading to suboptimal performance. To address this challenge, we propose a retrieval-augmented MOS prediction method, dubbed RAMP, to enhance the decoder's ability against the data scarcity issue. A fusing network is also proposed to dynamically adjust the retrieval scope for each instance and the fusion weights based on the predictive confidence. Experimental results show that our proposed method outperforms the existing methods in multiple scenarios.

#17 Can Better Perception Become a Disadvantage? Synthetic Speech Perception in Congenitally Blind Users [PDF¹] [Copy] [Kimi²]

Authors: Gerda Ana Melnik-Leroy ; Gediminas Navickas

Modern Text-To-Speech systems are rarely tested on non-standard user groups, such as people with impairments. Nevertheless, evidence suggests that some of these groups might perceive synthetic speech differently (better or worse) than regular users. The current study investigated for the first time how synthetic speech is perceived by blind vs. sighted users. For this purpose, we used a speeded AX discrimination task and tested how sighted and blind listeners perceive synthetic speech of different qualities. Results show that blind participants had significantly better discrimination on this task, and both groups performed worse when the perceptual differences in the synthetic speech were smaller. This suggests that blind participants were indeed more sensitive to the acoustic characteristics of synthetic speech compared to their sighted peers. We discuss implications for speech perception and the development of modern speech technologies.

#18 Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech [PDF] [Copy] [Kimi²]

Authors: Erica Cooper ; Junichi Yamagishi

Mean Opinion Score (MOS) is a popular measure for evaluating synthesized speech. However, the scores obtained in MOS tests are heavily dependent upon many contextual factors. One such factor is the overall range of quality of the samples presented in the test -- listeners tend to try to use the entire range of scoring options available to them regardless of this, a phenomenon which is known as range-equalizing bias. In this paper, we systematically investigate the effects of range-equalizing bias on MOS tests for synthesized speech by conducting a series of listening tests in which we progressively "zoom in" on a smaller number of systems in the higher-quality range. This allows us to better understand and quantify the effects of range-equalizing bias in MOS tests.

#19 Mitigating the Exposure Bias in Sentence-Level Grapheme-to-Phoneme (G2P) Transduction [PDF¹] [Copy] [Kimi²]

Authors: Eunseop Yoon ; Hee Suk Yoon ; Dhananjaya Gowda ; SooHwan Eom ; Daehyeok Kim ; John Harvill ; Heting Gao ; Mark Hasegawa-Johnson ; Chanwoo Kim ; Chang D. Yoo

Text-to-Text Transfer Transformer (T5) has recently been considered for the Grapheme-to-Phoneme (G2P) transduction. As a follow-up, a tokenizer-free byte-level model based on T5 referred to as ByT5, recently gave promising results on word-level G2P conversion by representing each input character with its corresponding UTF-8 encoding. Although it is generally understood that sentence-level or paragraph-level G2P can improve usability in real-world applications as it is better suited to perform on heteronyms and linking sounds between words, we find that using ByT5 for these scenarios is nontrivial. Since ByT5 operates on the character level, it requires longer decoding steps, which deteriorates the performance due to the exposure bias commonly observed in auto-regressive generation models. This paper shows that the performance of sentence-level and paragraph-level G2P can be improved by mitigating such exposure bias using our proposed loss-based sampling method.

#20 Streaming Parrotron for on-device speech-to-speech conversion [PDF] [Copy] [Kimi²]

Authors: Oleg Rybakov ; Fadi Biadsy ; Xia Zhang ; Liyang Jiang ; Phoenix Meadowlark ; Shivani Agrawal

We present a fully on-device streaming Speech2Speech conversion model that normalizes a given input speech directly to synthesized output speech. Deploying such a model on mobile devices pose significant challenges in terms of memory footprint and computation requirements. We present a streaming-based approach to produce an acceptable delay, with minimal loss in speech conversion quality, when compared to a reference state of the art non-streaming approach. Our method consists of first streaming the encoder in real time while the speaker is speaking. Then, as soon as the speaker stops speaking, we run the spectrogram decoder in streaming mode along the side of a streaming vocoder to generate output speech. To achieve an acceptable delay-quality trade-off, we propose a novel hybrid approach for look-ahead in the encoder which combines a look-ahead feature stacker with a look-ahead self-attention. We show that our streaming approach is 2x faster than real time on the Pixel4 CPU.

#21 Exploiting Emotion Information in Speaker Embeddings for Expressive Text-to-Speech [PDF] [Copy] [Kimi²]

Authors: Zein Shaheen ; Tasnima Sadekova ; Yulia Matveeva ; Alexandra Shirshova ; Mikhail Kudinov

Text-to-Speech (TTS) systems have recently seen great progress in synthesizing high-quality speech. However, the prosody of generated utterances often is not as diverse as prosody of the natural speech. In the case of multi-speaker or voice cloning systems, this problem becomes even worse as information about prosody may be present in the input text and the speaker embedding. In this paper, we study the phenomenon of the presence of emotional information in speaker embeddings recently revealed for i-vectors and x-vectors. We show that the produced embeddings may include devoted components encoding prosodic information. We further propose a technique for finding such components and generating emotional speaker embeddings by manipulating them. We then demonstrate that the emotional TTS system based on the proposed method shows good performance and has a smaller number of trained parameters compared to solutions based on fine-tuning.

#22 E2E-S2S-VC: End-To-End Sequence-To-Sequence Voice Conversion [PDF] [Copy] [Kimi³]

Authors: Takuma Okamoto ; Tomoki Toda ; Hisashi Kawai

This paper proposes end-to-end (E2E) non-autoregressive sequence-to-sequence (S2S) voice conversion (VC) models that extend two E2E text-to-speech models, VITS and JETS. In the proposed E2E-S2S-VC models, VITS-VC and JETS-VC, the input text sequences of VITS and JETS are replaced by the source speaker's acoustic feature sequences, and E2E models (including HiFi-GAN waveform synthesizers) are trained using monotonic alignment search (MAS) without external aligners. To successfully train MAS for VC, the proposed models use a reduction factor only for the encoder. The voice of a source speaker is converted directly to that of a target speaker using a single neural network in the proposed models in an S2S manner; the duration and prosody between the source and target speech can be directly converted. The results of experiments using 1,000 parallel utterances of Japanese male and female speakers demonstrate that the proposed JETS-VC outperformed cascade non-autoregressive S2S VC models.

#23 DC CoMix TTS: An End-to-End Expressive TTS with Discrete Code Collaborated with Mixer [PDF¹] [Copy] [Kimi¹]

Authors: Yerin Choi ; Myoung-Wan Koo

Despite the huge successes made in neutral TTS, content-leakage remains a challenge. In this paper, we propose a new input representation and simple architecture to achieve improved prosody modeling. Inspired by the recent success in the use of discrete code in TTS, we introduce discrete code to the input of the reference encoder. Specifically, we leverage the vector quantizer from the audio compression model to exploit the diverse acoustic information it has already been trained on. In addition, we apply the modified MLP-Mixer to the reference encoder, making the architecture lighter. As a result, we train the prosody transfer TTS in an end-to-end manner. We prove the effectiveness of our method through both subjective and objective evaluations. We demonstrate that the reference encoder learns better speaker-independent prosody when discrete code is utilized as input in the experiments. In addition, we obtain comparable results even when fewer parameters are inputted.

#24 Voice Conversion With Just Nearest Neighbors [PDF] [Copy] [Kimi³]

Authors: Matthew Baas ; Benjamin van Niekerk ; Herman Kamper

Any-to-any voice conversion aims to transform source speech into a target voice with just a few examples of the target speaker as a reference. Recent methods produce convincing conversions, but at the cost of increased complexity – making results difficult to reproduce and build on. Instead, we keep it simple. We propose k-nearest neighbors voice conversion (kNN-VC): a straightforward yet effective method for any-to-any conversion. First, we extract self-supervised representations of the source and reference speech. To convert to the target speaker, we replace each frame of the source representation with its nearest neighbor in the reference. Finally, a pretrained vocoder synthesizes audio from the converted representation. Objective and subjective evaluations show that kNN-VC improves speaker similarity with similar intelligibility scores to existing methods. Code, samples, trained models: https://bshall.github.io/knn-vc.

#25 CFVC: Conditional Filtering for Controllable Voice Conversion [PDF] [Copy] [Kimi¹]

Authors: Kou Tanaka ; Takuhiro Kaneko ; Hirokazu Kameoka ; Shogo Seki

This paper describes a many-to-many voice conversion model that filters the speaker vector to control high-level attributes such as speaking rate while preserving voice timbre. In order to control only the speaking rate, it is essential to decompose the speaker vector into a speaking rate vector and others. The challenge is to train such disentangled representations with no/few annotation data. Motivated by this difficulty, we propose an approach combining the conditional filtering method with data augmentation. The experimental results showed that our method disentangled complex attributes without annotation and separately controlled speaking rate and voice timbre. Audio samples can be accessed on our web page.